Is Your Anchor Going Up or Down? Fast and Accurate Supervised Topic Models

نویسندگان

  • Thang Nguyen
  • Jordan L. Boyd-Graber
  • Jeffrey Lund
  • Kevin D. Seppi
  • Eric K. Ringger
چکیده

Topic models provide insights into document collections, and their supervised extensions also capture associated document-level metadata such as sentiment. However, inferring such models from data is often slow and cannot scale to big data. We build upon the “anchor” method for learning topic models to capture the relationship between metadata and latent topics by extending the vector-space representation of word-cooccurrence to include metadataspecific dimensions. These additional dimensions reveal new anchor words that reflect specific combinations of metadata and topic. We show that these new latent representations predict sentiment as accurately as supervised topic models, and we find these representations more quickly without sacrificing interpretability. Topic models were introduced in an unsupervised setting (Blei et al., 2003), aiding in the discovery of topical structure in text: large corpora can be distilled into human-interpretable themes that facilitate quick understanding. In addition to illuminating document collections for humans, topic models have increasingly been used for automatic downstream applications such as sentiment analysis (Titov and McDonald, 2008; Paul and Girju, 2010; Nguyen et al., 2013). Unfortunately, the structure discovered by unsupervised topic models does not necessarily constitute the best set of features for tasks such as sentiment analysis. Consider a topic model trained on Amazon product reviews. A topic model might discover a topic about vampire romance. However, we often want to go deeper, discovering facets of a topic that reflect topic-specific sentiment, e.g., “buffy” and “spike” for positive sentiment vs. “twilight” and “cullen” for negative sentiment. Techniques for discovering such associations, called supervised topic models (Section 2), both produce interpretable topics and predict metadata values. While unsupervised topic models now have scalable inference strategies (Hoffman et al., 2013; Zhai et al., 2012), supervised topic model inference has not received as much attention and often scales poorly. The anchor algorithm is a fast, scalable unsupervised approach for finding “anchor words”—precise words with unique co-occurrence patterns that can define the topics of a collection of documents. We augment the anchor algorithm to find supervised sentiment-specific anchor words (Section 3). Our algorithm is faster and just as effective as traditional schemes for supervised topic modeling (Section 4). 1 Anchors: Speedy Unsupervised Models The anchor algorithm (Arora et al., 2013) begins with a V × V matrix Q̄ of word co-occurrences, where V is the size of the vocabulary. Each word type defines a vector Q̄i,· of length V so that Q̄i,j encodes the conditional probability of seeing word j given that word i has already been seen. Spectral methods (Anandkumar et al., 2012) and the anchor algorithm are fast alternatives to traditional topic model inference schemes because they can discover topics via these summary statistics (quadratic in the number of types) rather than examining the whole dataset (proportional to the much larger number of tokens). The anchor algorithm takes its name from the idea of anchor words—words which unambiguously identify a particular topic. For instance, “wicket” might be an anchor word for the cricket topic. Thus, for any anchor word a, Q̄a,· will look like a topic distribution. Q̄wicket,· will have high probability for “bowl”, “century”, “pitch”, and “bat”; these words are related to cricket, but they cannot be anchor words because they are also related to other topics. Because these other non-anchor words could be topically ambiguous, their co-occurrence must be explained through some combination of anchor words; thus for non-anchor word i,

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Comparative Study of Some Clustering Algorithms on Shape Data

Recently, some statistical studies have been done using the shape data. One of these studies is clustering shape data, which is the main topic of this paper. We are going to study some clustering algorithms on shape data and then introduce the best algorithm based on accuracy, speed, and scalability criteria. In addition, we propose a method for representing the shape data that facilitates and ...

متن کامل

Tandem Anchoring: a Multiword Anchor Approach for Interactive Topic Modeling

Interactive topic models are powerful tools for understanding large collections of text. However, existing sampling-based interactive topic modeling approaches scale poorly to large data sets. Anchor methods, which use a single word to uniquely identify a topic, offer the speed needed for interactive work but lack both a mechanism to inject prior knowledge and lack the intuitive semantics neede...

متن کامل

مدیر موفق کیست؟

Who is a really successful manager? A manager who spends less money, or the one who earns more? A manager who can survive for a longer period of time, or an administrator who expands his organization, and opens up new branches? Which one is the most successful? The article tries to answer these questions and provides, some simple guidlines for the managers in every domain of management who wan...

متن کامل

P14: How to Find a Talent?

Talents may be artistic or technical, mental or physical, personal or social. You can be a talented introvert or a talented extrovert. Learning to look for your talents in the right places and building those talents into skills and abilities might take some work, but going about it creatively will let you explore your natural abilities and find your innate talents. You’re not going to fin...

متن کامل

Weakly Supervised Top-down Salient Object Detection

Top-down saliency models produce a probability map that peaks at target locations specified by a task/goal such as object detection. They are usually trained in a fully supervised setting involving pixel-level annotations of objects. We propose a weakly supervised top-down saliency framework using only binary labels that indicate the presence/absence of an object in an image. First, the probabi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015